Learning (k, l)-Contextual Tree Languages for Information Extraction

نویسندگان

Stefan Raeymaekers

Maurice Bruynooghe

Jan Van den Bussche

چکیده

Learning regular languages from positive examples only is known to be infeasible. A common solution is to define a learnable subclass of the regular languages. In the past, this has been done for regular string languages. Using ideas from those techniques, we define a learnable subclass of regular unranked tree languages, called the (k,l)-contextual tree languages. We describe the use of this subclass to induce wrappers for Information Extraction from structured documents, such as web pages. Experiments show that our algorithm is able to learn from very few data, and compares favorably to similar state of the art approaches. Learning (k,l)-Contextual Tree Languages for Information Extraction Stefan Raeymaekers, Maurice Bruynooghe, and Jan Van den Bussche 1 K.U.Leuven, Dept. of Computer Science, Celestijnenlaan 200A, B-3001 Leuven, {stefanr,maurice}@cs.kuleuven.ac.be 2 University of Limburg, Dept. WNI, Universitaire Campus, B-3590 Diepenbeek, [email protected] Abstract. Learning regular languages from positive examples only is known to be infeasible. A common solution is to define a learnable subclass of the regular languages. In the past, this has been done for regular string languages. Using ideas from those techniques, we define a learnable subclass of regular unranked tree languages, called the (k,l)-contextual tree languages. We describe the use of this subclass to induce wrappers for Information Extraction from structured documents, such as web pages. Experiments show that our algorithm is able to learn from very few data, and compares favorably to similar state of the art approaches. Learning regular languages from positive examples only is known to be infeasible. A common solution is to define a learnable subclass of the regular languages. In the past, this has been done for regular string languages. Using ideas from those techniques, we define a learnable subclass of regular unranked tree languages, called the (k,l)-contextual tree languages. We describe the use of this subclass to induce wrappers for Information Extraction from structured documents, such as web pages. Experiments show that our algorithm is able to learn from very few data, and compares favorably to similar state of the art approaches.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Wrapper Induction: Learning (k,l)-Contextual Tree Languages Directly as Unranked Tree Automata

A (k, l)-contextual tree language can be learned from positive examples only; such languages have been successfully used as wrappers for information extraction from web pages. This paper shows how to represent the wrapper as an unranked tree automaton and how to construct it directly from the examples instead of using the (k, l)-forks of the examples. The former speeds up the extraction, the la...

متن کامل

Parameterless Information Extraction Using (k,l)-Contextual Tree Languages

Recently, several wrapper induction algorithms for structured documents have been introduced. They are based on contextual tree languages and learn from positive examples only but have the disadvantage that they need parameters. To obtain the optimal parameter setting, they use precision and recall. This goes in fact beyond learning from positive examples only. In this paper, a parameter estima...

متن کامل

Information extraction from structured documents using k-testable tree automaton inference

Information extraction (IE) addresses the problem of extracting specific information from a collection of documents. Much of the previous work on IE from structured documents, such as HTML or XML, uses learning techniques that are based on strings, such as finite automata induction. These methods do not exploit the tree structure of the documents. A natural way to do this is to induce tree auto...

متن کامل

Unsupervised Learning of Contextual Role Knowledge for Coreference Resolution

We present a coreference resolver called BABAR that uses contextual role knowledge to evaluate possible antecedents for an anaphor. BABAR uses information extraction patterns to identify contextual roles and creates four contextual role knowledge sources using unsupervised learning. These knowledge sources determine whether the contexts surrounding an anaphor and antecedent are compatible. BABA...

متن کامل

NLP Techniques for Term Extraction and Ontology Population

This chapter investigates NLP techniques for ontology population, using a combination of rule-based approaches and machine learning. We describe a method for term recognition using linguistic and statistical techniques, making use of contextual information to bootstrap learning. We then investigate how term recognition techniques can be useful for the wider task of information extraction, makin...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2005

Learning (k, l)-Contextual Tree Languages for Information Extraction

نویسندگان

چکیده

منابع مشابه

Wrapper Induction: Learning (k,l)-Contextual Tree Languages Directly as Unranked Tree Automata

Parameterless Information Extraction Using (k,l)-Contextual Tree Languages

Information extraction from structured documents using k-testable tree automaton inference

Unsupervised Learning of Contextual Role Knowledge for Coreference Resolution

NLP Techniques for Term Extraction and Ontology Population

عنوان ژورنال:

اشتراک گذاری